library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v stringr 1.4.0
## v tidyr   1.1.4     v forcats 0.5.1
## v readr   2.0.2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidyr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(RColorBrewer)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout




Introduction

Aviation saftey is a topic that hits close to home, as I recently lost a co-worker in an airplane crash. For this project, my goal is to explore airplane crashes worldwide to find out how often they are occurring, what manufacturer has the most, and how many fatalities are there as a result. For my project, I am using the Airplane Crash Data Since 1908 dataset, which is a collection of the number of crashes from 1908 till 2019.

The Dataset

#My data 
crashes <- read.csv("../data/airplanedata.csv")
head(crashes)
##         Date  Time                           Location               Operator
## 1  9/17/1908 17:18                Fort Myer, Virginia   Military - U.S. Army
## 2   9/7/1909                  Juvisy-sur-Orge, France                       
## 3  7/12/1912  6:30          Atlantic City, New Jersey   Military - U.S. Navy
## 4   8/6/1913       Victoria, British Columbia, Canada                Private
## 5   9/9/1913 18:30                 Over the North Sea Military - German Navy
## 6 10/17/1913 10:30         Near Johannisthal, Germany Military - German Navy
##   Flight..         Route                AC.Type Registration cn.ln Aboard
## 1          Demonstration       Wright Flyer III                  1      2
## 2               Air show         Wright Byplane          SC1            1
## 3            Test flight              Dirigible                         5
## 4                              Curtiss seaplane                         1
## 5                        Zeppelin L-1 (airship)                        20
## 6                        Zeppelin L-2 (airship)                        30
##   Aboard.Passangers Aboard.Crew Fatalities Fatalities.Passangers
## 1                 1           1          1                     1
## 2                 0           1          1                     0
## 3                 0           5          5                     0
## 4                 0           1          1                     0
## 5              NULL        NULL         14                  NULL
## 6              NULL        NULL         30                  NULL
##   Fatalities.Crew Ground
## 1               0      0
## 2               0      0
## 3               5      0
## 4               1      0
## 5            NULL      0
## 6            NULL      0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Summary
## 1 During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a height of approximately 75 feet, killing Lt. Thomas E. Selfridge, 26, who was a passenger. This was the first recorded airplane fatality in history.  One of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft.  Orville Wright suffered broken ribs, pelvis and a leg.  Selfridge suffered a crushed skull and died a short time later.
## 2                                                                                                                                                                                                                                                                                                                                                                                                 Eugene Lefebvre was the first pilot to ever be killed in an air accident, after his controls jambed while flying in an air show.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                              First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight.
## 4                                                                                                                                                                                                                                                                                                                                                                                                   The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed.
## 5                                                                                                                                                                                                                                                                                                             The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Island into the sea. The ship broke in two and the control car immediately sank drowning its occupants.
## 6                                                                                                                                                                                                                                                                                                                                                                                         Hydrogen gas which was being vented was sucked into the forward engine and ignited causing the airship to explode and burn at 3,000 ft..




Data Wrangling and Exploration

#Separating the Date column into 3 different ones to perform a better analysis
df <- separate(data = crashes,col = Date, into = c("Month", "Day", "Year"), sep = "\\/")
#Selecting the columns that I need to perform my analysis
date_df <- select(df, c(Month, Year, Fatalities, Aboard, AC.Type))
#Changing the data type of fatalities to a numeric
date_df$Fatalities <- as.numeric(date_df$Fatalities)
## Warning: NAs introduced by coercion
#Changing all the NAs to 0s
date_df[is.na(date_df)] <- 0
#Changing the data type of year to a numeric
date_df$Year <- as.numeric(date_df$Year)
#Creating a new column in our dataset called decade
date_df <- date_df %>% 
  mutate(Decade = floor(Year/10) * 10)
#Summing all the fatalities per Year
year_df <- date_df %>% 
  group_by(Year) %>% 
  summarise(Fatalities = sum(Fatalities))
#Summing all the fatalities per Decade
decade_df <- date_df %>% 
  group_by(Decade) %>% 
  summarise(Fatalities = sum(Fatalities))
#Creating a new dataset, that has no NAs
df_country <- crashes
df_country[is.na(df_country)] <- ""

#grouping them by Location, and tallying them up, and selecting the locations that had more than 10 crashes
df_country1 <- df_country %>% 
  group_by(Location) %>% 
  tally() %>% 
  filter(n >= 10)

#splitting the location couling into city, region, and country
df_country1$Loc <- df_country1$Location
df_country1 <- separate(data = df_country1, col = Location, into = c("City", "Region", "Country"), sep = "\\,")
df_country1[is.na(df_country1)] <- ""
#Sub-dataset 
small_DF <- select(crashes, c(AC.Type, Aboard))
#Changing Aboard to a numeric
small_DF$Aboard <- as.numeric(small_DF$Aboard)
## Warning: NAs introduced by coercion
small_DF <- small_DF %>%
  #Getting the first word of the manufacturer to use for later analysis
  separate(col = AC.Type, into = c("Manufacturer"), sep = "\\ ") %>% 
  #grouping by the manufacturer
  group_by(Manufacturer) %>% 
  #adding up the total aboard per manufacturer
  summarise(x = Aboard) %>% 
  arrange(desc(x)) %>% 
  #filtering out flights that had less than 250 because I felt that would skew the results
  filter(x > 250)
## Warning: Expected 1 pieces. Additional pieces discarded in 4896 rows [1, 2, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, ...].
## `summarise()` has grouped output by 'Manufacturer'. You can override using the `.groups` argument.
small_DF <- small_DF %>%
  group_by(Manufacturer) %>% 
  summarise(x = sum(x)) %>% 
  arrange(desc(x)) %>% 
  filter(x > 250)

small_DF
## # A tibble: 5 x 2
##   Manufacturer     x
##   <chr>        <dbl>
## 1 Boeing        6590
## 2 McDonnell     3853
## 3 Airbus        1990
## 4 Lockheed      1214
## 5 Ilyushin       532
#Changing the data type of Aboard to a numeriv
date_df$Aboard <- as.numeric(date_df$Aboard)
## Warning: NAs introduced by coercion
df_fatalityRate <- date_df %>%
  separate(col = AC.Type, into = c("Manufacturer"), sep = "\\ ") %>% 
  #filtering for manufacturers that are in the dataset I created in the last chunk
  filter(Manufacturer %in% small_DF$Manufacturer) %>%
  group_by(Manufacturer, Year) %>% 
  #summing the totals across all flights per manufacturer 
  #I did it this way as I believe that this is more accurate then taking the average.
  summarise(total_Fatalities = sum(Fatalities), total_Aboard = sum(Aboard)) %>% 
  #calculating the fatality rate
  mutate(Fatality_Rate = (total_Fatalities/total_Aboard) * 100)
## Warning: Expected 1 pieces. Additional pieces discarded in 4896 rows [1, 2, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, ...].
## `summarise()` has grouped output by 'Manufacturer'. You can override using the `.groups` argument.
df_fatalityRate
## # A tibble: 294 x 5
## # Groups:   Manufacturer [5]
##    Manufacturer  Year total_Fatalities total_Aboard Fatality_Rate
##    <chr>        <dbl>            <dbl>        <dbl>         <dbl>
##  1 Airbus        1976                7          258          2.71
##  2 Airbus        1987                5            5        100   
##  3 Airbus        1988              293          426         68.8 
##  4 Airbus        1990               92          146         63.0 
##  5 Airbus        1992              367          376         97.6 
##  6 Airbus        1993                2           70          2.86
##  7 Airbus        1994              349          752         46.4 
##  8 Airbus        1995               60           60        100   
##  9 Airbus        1997              234          234        100   
## 10 Airbus        1998              298          469         63.5 
## # ... with 284 more rows




Data Analysis

Has the number of airplane creashes increased or decreased over the years/decades?

#creating the plot
ggplot() +
  
  #adding the line that represents fatalities per year
  geom_line(year_df, mapping = aes(x = Year, y = Fatalities), color = "Blue")+
  
  #adding the line that represents fatalities per decade
  geom_line(decade_df, mapping = aes(x = Decade, y = Fatalities/10), color = "Red")+
  
  #adding title, subtitle, and caption
  labs(title = "Fatalities by Year and Decade",
       subtitle = "Worldwide Airplane Fatalities from 1908 till 2019",
       caption = "Data Source: Kaggle.com") +
  geom_smooth()+
  
  #making the dual axis, and adding commas to the y-axis scales
  scale_y_continuous(
    name = "Annual Fatalities",
    labels = scales::comma,
    sec.axis = sec_axis(~.*10, name = "Decade Fatalities", labels = scales::comma))+
  
  theme_minimal() +
  
  #making the coloring of the y-axis match the color of their respective lines
  theme(axis.text.y.left = element_text(color = "blue")) +
  theme(axis.text.y.right = element_text(color = "red")) +
  
  #positioning the title left aligned, not aligned with the plot
  theme(plot.title.position = "plot")

#For this graph, I tried adding the colors to the title of the graph to match the lines, but i could not figure it out.

In this graph, you can see that based on the years, fatalities peaked just before 1975. It then had a slight decrease. Based off of the decades, it is showing the peak fatalities around the same time, but it shows that overall the Fatalities were steadly decreasing.




What Regions have the most amount of crashes?

df_country1 %>% 
  ggplot(aes(x = reorder(Loc, n), y = n, fill = Region)) +
    geom_col() +
    scale_fill_brewer(palette = "Blues")+
    coord_flip() +
    labs(title = "Regions with the Most Crashes",
         y = "Number of Crashes", 
         x = "") +
    geom_text(aes(label = n), hjust = -0.5)+
  theme_minimal() +
  guides(fill = FALSE) +
  theme(plot.title.position = "plot")

> This graph was shows the regions with the most common plane crashes. I think it is intresting that Russia has a decent size gap over New York. I say decent because as I was looking at this graph I noticed that there was a location called “near Moscow”, so adding that to the other bar of Russia, it has a total of 28 plane crashes.




What is the Fatality Rate per Year by Manufacturer?

#The commented out lines didn't work. I believe it has to do with me using plotly. But if I wasnt, I would have left aligned my title, and added a caption.
f <- df_fatalityRate %>% 
  ggplot(data = df_fatalityRate, mapping = aes(x = Manufacturer, y = Year, fill = Fatality_Rate))+
  labs(title = "Fatality Rate by Manufacturer") +
       #caption = "Data Source: kaggle.com") +
  geom_tile() +
  guides(fill = FALSE)
  #theme(plot.title.position = "plot")
ggplotly(f)

This graph is showing the top manufacturers, and the fatality rates per year. The darker the color, the lower the fatality rate. I mention this because I want to compare Boeing and Lockheed Martin. They are two major companies and as you can see Lockheeds fatality rate is higher. (I guess that is a good thing, as boeing supplies the commericial aircraft.)




Future Works

I really enjoyed exploring this dataset, and I wish I had more time. Some ideas that I wish to add:

  1. exploring how the operator of the flight could have effected the outcome of the flight. I noticed there was a lot of military operators, both foreign and domestic.
  2. use data mining to explore the summary of the crash to see how many were truly accidents vs. planned attacks.